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ABSTRACT 


Although there is voluminous data flow in social media, it is still possible to 
create an effective system that can detect malicious activities within a shorter time and 
provide situational awareness. 

This thesis developed patterns for a probabilistic approach to identify malicious 
behavior by monitoring big data. We collected twenty-two thousand tweets from publicly 
available Twitter data and used them in our testing and validation processes. We 
combined deterministic and nondeterministic approaches to monitor and verify the 
system. In the deterministic part, we determined assertions by using natural language 
(NL) and associated formal specifications. We then specified visible and hidden 
parameters, which are used for subsequent identification of hidden parameters in Hidden 
Markov Model (HMM) techniques. In the nondeterministic part, we used probabilistic 
formal specifications with visible and hidden parameters, used in HMM, to monitor and 
verify the system. 

An important contribution of the work is that we specified some event patterns 
indicating malicious activities. Based on these patterns, we obtained output to indicate the 
possibility of each tweet being malicious. 
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I. 


INTRODUCTION 


A. MOTIVATION 

1. Social Media and Twitter 

The Internet has become indispensable in our daily lives. It allows us to create 
and maintain communication and collaboration; through the use of social media and 
Twitter we can share everything from special milestones to routine daily activities. 

According to statistics [1], there are nearly 3.010 billion active Internet users, 
which is nearly 41% of the world population, and 2.078 billion active social media 
accounts. The annual growth is 21% for Internet users and 12% for active social media 
accounts (Figures 1 and 2). 



OLOBAL DIGITAL SNAPSHOT 

A SNAPSHOT Of THE WOOlDS KEY DOfUl STAJtSTOA MOoCAIOOS 


TOTAL ACTIVE ACTIVE SOCIAL UNIOUE ACTIVE MOBILE 

POPULATION INTERNET USERS MEDIA ACCOUNTS MOBILE USERS SOCIAL ACCOUNTS 



URBANISATION: 53% PENETRATION 42% PENETRATION: 29% PENETRATION: 51% PENETRATION 23% 


* Arm Social l*w< 




Figure 1. Snapshot of the World's Key Digital Statistical Indicators. Source: [1]. 
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YEAR-ON-YEAR GRO 

HOW THE OCIUL WOftlD HAS EYCXfEO OVffi THE PAST 12 MONTI 


TOTAL ACTIVE ACTIVE SOCIAL UNIQUE ACTIVE MOBILE 

POPULATION INTERNET USERS MEDIA ACCOUNTS MOBILE USERS SOCIAL ACCOUNTS 
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Figure 2. Annual Growth of the Digital World. Source: [1]. 


Twitter is a social networking service on the Internet. Twitter allows registered 
users to compose and send short messages of up to 140 characters, called tweets, to 
provide interconnection between users [2]. It also allows unregistered users to read tweets 
sent by others. We have focused on Twitter in this thesis since it is one of the biggest 
social media sites with 645,750,000 registered users [3] and has open source public 
tweets for data mining. 

2. Malicious Users and Tweets 

In the modem world, we face a new generation of terrorists, such as ISIS, who 
use social media, especially Twitter, to recruit new members. Propagandists can spread 
up to 200,000 tweets per day, using the platform as a propaganda tool with videos and 
pictures to feed the bad feelings of their supporters, spread fear to innocent people, and 
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trigger a series of lone-wolf terrorist attacks on orders from a terrorist leader on another 
continent [4]. 

Social media is a convenient place for malicious users for several reasons: 

1. Detection is easily avoided. 

2. Many people can be accessed with little effort. 

3. Users in social media can share personal information such as credit 
cards, passwords, and private data. 

4. It is easy to manipulate people by using popular contents, 
honeypots, and familiar accounts. 

According to [5], Twitter has suspended nearly 125,000 accounts in the seven 
months leading up to February 2016 due to terrorist acts. Twitter has been reducing the 
response time for detecting malicious users and tweets and suspending the accounts with 
newly recruited staff. However, Twitter said that hunting terrorists on the web is not so 
simple since there is no “magic algorithm” to detect terrorist content. 

In [6], Berger and Morgan give exhaustive information about the features of ISIS- 
supporter accounts, their tweeting patterns, and other twitter metrics. It is possible to use 
some of these metrics in our thesis for specifying the malicious users’ patterns. They are: 

1. For about one out of five ISIS-supporter accounts, the primary language is 
English (73% Arabic, 18% English, and 6% French). It is common for the 
tweet to have an English hashtag with Arabic content. 

2. The average number of followers of these accounts is about 1,004 (higher 
than regular users). While the number of the followers for a typical user is 
208, for celebrities it is in the millions. ISIS supporters, therefore, have 
more followers than ordinary users. It is very unlikely to see an ISIS 
supporter with more than 20,000 followers. 

3. The top locations of the accounts are Syria, Iraq, and Saudi Arabia. The 
location information of the accounts is very important for detecting these 
accounts. However, only a few users have enabled the location features. In 
addition, some of the users are using other applications to distort the GPS 
coordinates in creating the tweet. 
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4. 


Botnets and applications have been used to propagate a large number of 
tweets with specific content. 


5. According to the statistics, nine out of ten ISIS-supporter accounts are 
following less than 1000 users. It is not a better indicator than follower 
information, but it can be useful to increase precision [6]. 

According to [6], we make some assumptions to help us identify malicious 

tweets: 

1. The owners of the regular tweets have longer account lives than malicious 
users who tend to close their accounts for reasons such as hiding their real 
identity or accessing different users. Moreover, malicious accounts can be 
closed by Twitter after malicious behaviors have been detected. 

2. The users spreading malicious tweets have more friends than followers 
since they want to access more people; however, they are not known by 
others in the network. 

3. The accounts do not enable the geo location. 

4. While a high number of followers (>20,000) indicates celebrities, a low 
number (<500) indicates regular users. Suspicious users are usually 
between them. 

5. Malicious users follow less than 1000 users. 

B. PURPOSE OF THE STUDY 

Social media is a rapidly growing and changing space. It is a data pool, in that a 
wide variety of information can be captured as long as one knows how. This situation 
encourages us to adopt a new technique for the detection of security-related malicious 
activities. In this thesis, we will use a hybrid of probabilistic and formal methods to 
detect malicious activities. 

C. THESIS STRUCTURE 

Chapter I provides the motivation for this thesis. It explains the importance of 
social media in our lives. Chapter II provides background information on formal 
methods, formal specifications, the Hidden Markov Model, runtime monitoring, and 
verification. This chapter is important for understanding a new system based on these 
steps of the development cycle. Chapter III describes the steps for collecting, filtering, 
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and reforming the data acquired from Twitter. It provides useful information for those 
who want to data mine in Twitter, and presents the natural language assertions and 
corresponding rule patterns. It then describes the steps performed using screenshots from 
the toolset. The last chapter, Chapter IV, addresses thoughts and implications regarding 
the new technique. 
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II. BACKGROUND 


A. NATURAL LANGUAGE REQUIREMENTS 

The typical software development process consists of requirement, design, and 
implementation stages. Before implementing any type of formal method, software 
developers try to present what they understand in terms of requirement and 
environmental arguments using natural language. Requirements are essential for a 
proposed system, since the system development cycle is cumulative in manner. If a 
developer misses a requirement, it can cause a heavy cost to fix the error. When 
stakeholder and system needs are specified as requirements in natural language, system 
developers should translate them into formal specifications for solving the problem in a 
systematic manner [7]. 

One may ask, “If we need to turn requirements into formal specifications, why are 
we using natural language in the initial stage?” (As done in this project.) The answer is 
that natural language is necessary because stakeholders, customers, or prospective users 
probably do not understand the formal specifications. Another reason is that nobody 
wants to sign a contract written in formal specification language [8]. So, natural language 
specification is vital for ensuring that everyone is on the same page. 

B. FORMAL METHODS AND FORMAL SPECIFICATIONS 

The growing use of software-intensive systems has increased the complexity of 
software development. This complexity multiplies the likelihood of errors and increases 
development cost. The major goal of software development is to develop reliable, 
efficient software-intensive systems despite the growing complexity. At this point, formal 
methods (FM) depending on a well-designed mathematical structure are very useful for 
solving the problem and providing precise implementations. FMs are a general collection 
of techniques including formal specification (FS) and program verification [9]. 

FS is a technique based on a mathematical structure. It consists of syntax for 
grammatical rules and semantics for interpretation [10]. The purpose of the FS is to 
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clearly represent a cognitive or natural language requirement and make it easy to monitor 
system behaviors [11]. 

The use of formal methods and specifications has changed over time as a result of 
the growing complexity of systems. Because full formalization of systems has become 
more expensive and difficult, some scientists have proposed lightweight formal methods 
focused on partial specification and application [12]. 

Lightweight formal methods are very useful for reducing development cost. 
Moreover, FSs and lightweight FMs improve the clarity and precision of specified 
requirements [13]. Runtime execution monitoring (REM) is a lightweight FM that 
monitors the behavior of a running system and can detect improper behavior and 
requirements in early stage of development. Debugging of the requirements and early 
detection of the errors in the design process prevents extra cost and time. 

In [13], Drusinsky presents a new type of formal method using a combination of 
runtime monitoring, execution-based model checking, and UML-based formal 
specifications with statechart assertions that provide unambiguous, clear, and visual 
presentation of the model. This new formalism uses deterministic/nondeterministic 
statechart assertions as its specification language [13]. Figure 3 shows a statechart 
assertion in which there is a start state, final state, event state, timer, and transitions 
between them. 



Figure 3. A Statechart Assertion. 
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C. DETERMINISTIC RUNTIME VALIDATION AND VERIFICATION 

The correctness of a system is directly related to the validation and verification of 
the system. Verification checks whether the system produces correct results given its 
input. It does so by comparing the expected output to the actual output. If there is an 
inconsistency, it means that verification failed. Validation looks for whether the system 
meets our intended purposes. Is it the right product to build? While the verification 
process focuses on the internal parts such as behaviors of the system, the validation 
process checks the overall system as a final product and compares to the intended product 

m. 

D. FORMAL VERIFICATION AND VALIDATION TRADEOFF SPACE 

As Drusinsky et. al. pointed out in [7], there are cost and coverage space tradeoffs 
for verification and validation (V&V) techniques. We will describe the tradeoffs by using 
two 3-dimensional (3D) cuboids. These vectors are specification/validation, 
program/implementation, and verification. Figures 4 and 5, respectively, represent the 
coverage space and cost space tradeoffs of V&V techniques. 

Three types of V&V techniques are: theorem proving (TP); model checking or 
property checking (MC); and execution-based model checking (EMC) combining 
runtime verification (RV) with automatic test generation (ATG). 

1. Theorem Proving 

High Order Logic (HOL) and Stanford Temporal Prover (STeP) are the methods 
using TP. TP employs mathematics-based proof techniques to provide a persuasive 
argument that demonstrate that a program complies with its requirements. This technique 
requires a human driver to solve the underlying problem which is generally undecidable. 

In respect to cost and coverage tradeoffs, an important aspect of TP is that a 
human operator whose skill level changes according to the choice of the specification 
language is required [14]. In TP, specification languages such as temporal logic are 
generally difficult to understand and use. These languages are hard to implement since 
they are different from the languages that software programmers use. It is difficult to 
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validate formal specifications with limited knowledge about temporal logic syntax. So, it 
has low specification coverage and high specification cost. TP depends on the 
programming languages that contain many inconsistencies with the existing Java or C++ 
applications. So it deserves the low program coverage and high implementation cost in 
terms of the program/implementation dimension. While the expertness requirement of TP 
causes high verification cost, the existence of well-educated and wise users provides high 
verification coverage [7]. 

2. Model Checking 

MC is a kind of algorithmic FV technique that checks the state space of the model 
exhaustively for whether it satisfies the given specifications. Some of the published MC 
tools are SPIN Model Checker and Spatial Logical Model Checker (SPML) verifying the 
correctness of the distributed model. 

When a program is set up for MC, there is no need for an expert human operator. 
So, contrary to TP, human expertise is not required in MC. It has a lower verification cost 
than TP. Both TP and MC have text-based specifications, which causes difficulty in 
visualization and validation processes for the designer. Consequently, MC deserves low 
specification coverage similar to TP [7]. With respect to the program/implementation 
dimension, the most vulnerable point of this verification technique is the blowing 
up of the state-space (a.k.a. combinatorial explosion) problem. This problem typically 
causes the verification process to be crushed by an exponentially growing state-space 
[15]. Eventually, while the technique deserves low coverage and high cost for 
program/implementation dimension, it has high coverage and low cost for the verification 
dimension because of the automatic verification and full coverage of components. 

3. Execution-Based Model Checking (EMC) 

Runtime verification (RV) and automated test generation (ATG) are components 
of EMC. Some of the RV tools are DBRover and StateRover that include statechart 
diagrams as specification language. StateRover also has an automatic test generator. 
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In runtime verification, the system is monitored while running and generating 
tests by ATG. With this technique, UML-based StateRover specification language—a 
dynamic, lightweight V&V tool—can be used. It has automation and can cope with large, 
complicated systems. Hence, it has high coverage and low cost for the specification and 
program/application dimensions. Reliability to ATG is the one weakness of this 
technique; it is possible to miss a violation that ATG cannot generate a suitable test. So, 
ATG provides low cost and intermediate coverage for the verification dimension [15]. 
Although RV has less coverage than conventional verification techniques like MC and 
TP, it keeps us away from the complexity of the verification process [16]. 



Figure 4. Coverage Space. Source: [7]. 
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Figure 5. Cost Space. Source: [7]. 

According to the comparison of three V&V techniques, EMC has lower 
implementation, verification, and specification costs. It has higher program and 
specification coverages. 

Note that we do not verify any software systems in our thesis. We use a powerful 
FS language with runtime monitoring in EMC since runtime monitoring satisfies our 
needs for situational awareness, which is a prerequisite for stable and reliable systems. 
The method ensures more consistent and trustworthy results for categorization of tweets 
in a running system. 

E. HIDDEN MARKOV MODEL (HMM) 

Markov Models are stochastic models that are used in randomly altering systems. 
They have a list of possible states. Each present state has possible future state(s). There 
are four main Markov Models used in various problem areas [17]. They are the Markov 
chain, the Markov decision process, HMM, and the partially observable Markov decision 
process (Table 1). 
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Table 1. Markov Models. Source: [17]. 



System state is fully 
observable 

System state is partially 
observable 

System is 
autonomous 

Markov chain 

Hidden Markov model 

System is 
controlled 

Markov decision process 

Partially observable Markov 
decision process 


In simpler Markov models such as the Markov chain, there is only one 
parameter—“state transition probabilities”—and states are fully observable. In HMM, the 
states are not fully visible and each state has possible observations, state transition 
probabilities, and output probabilities [18]. 



Figure 6. Markov Model. Source: [18]. 


In Figure 6, X, y, a, and b indicate states, possible observations, state transition 
probabilities, and observation probabilities, respectively. 

An HMM is very similar to a state machine: both have states and transitions 
between states. Each transition between states and the observable of the states are 
assigned a probability value between 0 and 1. The model determines one of the possible 
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outputs by looking at the sequence of observables [19]. In our thesis, the hidden states of 
the model is the categorization of the tweets. We will look for patterns, which are seen as 
rules in Section 3, to interesting flag sequences of tweets. 

F. RUNTIME MONITORING AND VERIFICATION OF SYSTEMS WITH 

HIDDEN DATA 

Runtime monitoring (RM) is a technique for observing runtime system behavior. 
While doing so, it detects formal specification violations. 

When applying RM and RV to complex systems, the required information such as 
the existence of malicious email or tweets are not fully observable. In [20], Drusinsky 
presents a RM technique that can be implemented in a system, which includes hidden 
events. The technique uses UML-based statechart assertions; it combines HMM and RM 
of formal specification assertions [20]. 

In this study, we combine HMM with RM of statechart assertions, where the 
HMM is used categorize tweets as one of: malicious, suspicious or benign. 

The flow chart of a system using RM with hidden data processing is shown in 
Figure 7. 
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Figure 7. Flow Chart. Adapted from [20]. 
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III. RUNTIME MONITORING OF TWEETS 


A. COLLECTING AND FILTERING DATA 

The Twitter Streaming API was used to collect data from Twitter. First, we 
created an account and generated tokens to be used as user credentials. Figure 8 shows 
the Python code used for downloading tweets [21]. We attached the unique user 
credentials provided by Twitter for the variables access_token, access_token_secret, 
consumer_key, and consumer_secret. 


♦ Code modified from "Connecting to Twitter Streaming API and downloading data" code 

♦ obtained from http://adilmoujahid.com/posts/2014/07/twitter-analytics/ 
import json 

import time 

import pandas as pd 

import matplotlib.pyplot as pit 

♦Import the necessary methods from tweepy library 

from tweepy import Stream 

from tweepy import OAuthHandler 

from tweepy.streaming import StreamListener 

♦Variables that contains the user credentials to access Twitter API 

access_token_secret ” 

consumer_)cey = 

consumer_secret 

class listener (StreamListener): 
def on_data (self,data): 
print (data) 
return True 

def on_error (self,status): 
print (status) 

return TRUE ♦do not kill the stream 

def on_timeout (self): 

return TRUE ♦do not kill the stream 

auth = OAuthHandler(consumer_key, consumer_secret) 
auth.set_access_token(access_token, access_token_secret) 
twitterStream = Stream(auth, listener()) 

twitterStream.filter(locations=(-180,-90,180,90], languages = ['en']) 


Figure 8. Python Code to Download Data from Twitter. Source: [21]. 


We collected 22,000 tweets from publicly available Twitter data in the JSON data 
structure. In this form, the data is not reader-friendly and is unsuitable for validation and 
formal specifications. Moreover, there are too many features for each tweet. Hence, it is 
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necessary to filter them for specific features of interest, such as when the user account 
was created, the number of followers and friends, the number of retweets, text, and so on. 
The data was converted into csv format and filtered by using the Google Refine tool, 
which is a powerful tool for messy data. Google Refine can filter the data and transform 
it to another format. Figure 9 shows a snippet of data including columns and records in 
Google Refine. 







Open Export * 

Help 

9829 records 




Extensions Freebase » 

Show as: rows records Show 5 10 25 50 records 



ous 1 -10 next > 

ast» 

▼ All 

▼ tweet_created_a 

▼ tweet Jd 

▼ text 

▼ retweeted 

▼ user_id 

▼ user_created_at ▼ us 

1. 

Sat Aug 01 06:13:07 

627361395173003264 

1 named those pictures or my pc Ho LaurenV LOL1 think Tl be a dreamer forever 

false 

13933851% 

Mon Jul % 06:57:43 

false 

2. 

Sat Aug 01 06:13:07 

627361395479150592 

also runnrg into ©Nate Paulson and ©TastyTreatMusic just made my night 

879106909 

54142029 

Mon Jul 06 06:57:43 

false 




\ud83d\udcaf YEEE family gatherng 





3. 

Sat Aug 01 06:13:07 

627361395227492352 

I'm feeing it tonight http Wt.coVGRKUpITXPD 

627361393465909200 

1045234772 

Sat Dec 2915:51:03 

false 

4. 

Sat Aug 01 06:13:07 

627361395558846465 

©movnouter 1 can stl sleep while 1 am standing and 1 can wake up without making 

304170141 

304170141 

Tue May 24 

false 




too much fuss Adults need nap too Not only children 



01:57:16 


5. 

Sat Aug 01 06:13:07 

627361395772751872 

For the love of god someone tel me what the common app essays are 

false 

1428170821 

Sat Jul 19 22:39:47 

false 

6. 

Sat Aug 01 06:13:07 

627361395814875136 

nope https:WtcoVT5pw29ZFtn 

false 

2660840210 

Sat Jul 1922:39:47 

false 

7. 

Sat Aug 01 06:13:07 

6273613% 192251904 

s 

| 

o 

2 

i 

false 

107768477 

Sat Jul 1922:39:47 

false 

8. 

Sat Aug 01 06:13:07 

627361395403833344 

Lis so chideh 

false 

107768477 

Sat Jul 19 22:39:47 

false 

9 

Sat Aug 01 06:13:07 

627361395454148608 

\u2728Frst Tme n Roi Et\ud83d\ude0a\u2728\ud83d\udc4d 

false 

206247744 

Fri Oct 22 15:33:29 

false 




S\u0e07\u0e32\u0e 19\u0e04\u0e 19\u0e04\u0e38',u0e49\u0e 19\u0e40\u0e04\u0e22 








S\u0e07\u0e32\u0e 19\u0e40\u0e 14\u0e08\u0e32\u0e27\u0e39 \ud83d\ude1d 








SKrungsrMoneyFestival2015 #RobnsonRwet\u2026 https :Wt.coV8buvyvnVEX 





10. 

Sat Aug 01 06:13:07 

627361396456423425 

The blue moon is so cool 

false 

445742603 

Sat Dec 24 20:13:53 

false 


Figure 9. A Snippet of Data in Google Refine. 


B. MEANING OF THE DATA COLUMNS 

We used the variables shown as columns in Figure 10. The meaning of each 
column is defined in Table 2. 


user created at 

text 

followers count 

friends count 

user verified 

geo enabled 

2009-03-13 

Super Lol https:\/\/t.co\/lrj3JaAEjn 

2946 

1506 

FALSE 

TRUE 

2009-04-18 

@BluthX @joanwalsh How is stating fa 

219 

740 

FALSE 

TRUE 

2009-06-03 

23.1 use to be the captain of SCTCC LYV 

204 

149 

FALSE 

TRUE 

2010-01-14 

The seats are being filled ahead of the 

265880 

420 

TRUE 

TRUE 

2010-05-15 

@staceymurdoughl thanks chick! I'll ta 

1226 

828 

FALSE 

TRUE 

2010-05-18 

@The Gatorr get on damn lol 

695 

2 

FALSE 

FALSE 

2010-07-12 

@TT Sisters @LittleMeThatter @GaryB 

204 

149 

FALSE 

FALSE 

2010-07-13 

@Theominiking\nYou was absolutely 

580 

1 

FALSE 

FALSE 

2010-07-19 

talayelhttp 

935 

24 

FALSE 

FALSE 


Figure 10. A Snippet of Data Columns Used in Thesis. 
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Table 2. Meaning of the Columns. 


Columns 

Meaning 

user_created_at 

The date when the user account of the tweet is created 

text 

The messages sent through the Internet; the main body of the tweet. 

followers_count 

The current number of followers for this account. 

friends_count 

The number of users that this account is following. 

user_verified 

If set to True, then Twitter has officially certified the user’s identity. 

geo_enabled 

If set to True, the user has enabled Twitter to access location 
information. 


In Table 3, we classified all variables as alpha and beta columns since we use 
different columns for some steps in our work flow. For example the user_verified and 
geo_enabled columns are only used in the learning phase of the work flow. 


Table 3. Alpha and Beta Columns. 


Type 

Column Name 

Stage to being implemented 

Alpha columns 

followers count 

These columns are used in the 
learning phase for determining 
the HiddenState column. 

friends count 

user verified 

geo enabled 

Beta columns 

user created at 

These columns are used in the 
R4B and RM phases. 

text 

followers count 

friends_count 


C. NATURAL LANGUAGE ASSERTIONS AND DETERMINISTIC RULE 
DEVELOPMENT 

The first step for development of a RM system for tweets is to specify 
requirements by using natural language (NL) and determine corresponding formal 
specifications (FS) [10]. We use Rules4business (R4B) for formal specifications. The 
requirements in NL are expressed by patterns provided in R4B. 
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For determining malicious tweets, we made some assumptions based on Chapter I 
(Motivation), as follows. 

The owners of the regular tweets have a longer account life than malicious users 
who tend to close their account for reasons such as hiding their real identity or accessing 
different users. Moreover, malicious accounts can be deleted by Twitter. 

In [22], Shingh, Bansal, and Sofat point out that malicious users reach a large 
number of friends in a short time and use popular links to spread their tweets faster and 
attract attention. On the other hand, famous users in Twitter have a larger number of 
followers and smaller number of friends. Therefore, we can assume that a large number 
of followers (>20,000) indicates celebrity status, whereas low numbers (<500) indicate 
regular users. Suspicious users are usually between the two values in number of 
followers. On the other hand, malicious users follow fewer than 1000 users 
(friendsclOOO). 

According to our inferences from [4], [5], and [6], malicious users’ accounts are 
not verified by Twitter and the users often disable the geolocation option. Their account 
life is two years or less and they usually open new accounts before their current accounts 
are identified as malicious. 

We use R4B to choose and customize our statechart assertions. R4B has two 
different interfaces for customization of instances and validation, respectively. In the first 
page, users select rules according to the NL assertions. They can create and edit 
instances. In the second page, in order to validate assertions, users upload the spreadsheet 
including the columns that each rule requires. These columns are our beta columns. We 
create five rules; they are instances of R4B rules: Rule-1, Rule-3, Rule-9, Rule-19, and 
Rule 21 as shown in Table 4. In Table 4, a pattern is called as a generic rule in R4B. 
“P=HiddenState=="M" and friends_count<1000 and followers_count<20000 and 
followers_count>500” means the event in which the tweet has more than 1000 friends. 
This tweet also is marked as malicious and its number of followers is between 500 and 
20000. “HiddenState=="B"” and “HiddenState=="S"” indicate the tweets marked as 
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benign and suspicious, respectively. Figures 11 to 15 show the statechart assertions for 
generic rules called as patterns in Table 4. 


Table 4. Instances of R4B Patterns. 


Rules4business 


Rule-1 

Pattern 

Flag whenever event P happens. 

Events and 
Limits&Bounds 

P=HiddenState=="M" and friends_count<1000 and 
followers_count<20000 and followers_count>500 

Description 

Flag the tweet whenever there is a malicious (HiddenState) 
tweet with <1000 friends and >500 followers and <20000 
followers. 

Rule-3 

Pattern 

Flag whenever no event Q occurs between P and R. 

Events and 
Limits&Bounds 

Q=HiddenState=="B", P=HiddenState=="M", 

R=HiddenState=="M". 

Description 

Flag the tweet whenever there is no benign tweet between 
two malicious tweets. 

Rule-9 

Pattern 

Flag whenever some pair of consecutive E events is less 
than time T apart. 

Events and 

Limits 

E=HiddenState==="S", Time bounds: T=4, Time units: 
weeks 

Description 

Flag the tweet whenever two users have suspicious 
(HiddenState) tweets and are created <4 weeks apart. 

Rule-19 

Pattern 

Flag whenever more than N events E within time T after Q. 

Events and 
Limits&Bounds 

E=friends_count<1000 and followers_count<20000 and 
followers count>500, Q HiddenState "M" 

N=3, T=10 

Description 

Flag the tweet whenever there are >3 tweets which are 
<1000 friends and >500 followers and <20000 followers, 
within 10 days after the user of a malicious (HiddenState) 
tweet created. 

Rule-21 

Pattern 

Flag whenever event Q occurs >N times between some 
pair of consecutive E events 

Events and 
Limits&Bounds 

E=text.indexOf(“http”)>=0, E=Q=HiddenState===“S” 

N=2 (count bounds) 

Description 

Flag when there are >2 suspicious tweets between any two 
tweets including http link. 
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♦ Rule 1 Flag whenever event P happens (1) 



u 


Instances of this rule: 


Instance ID 

Events 

Count 

limits 

Time 

bounds 

Time 

units 

Expiration 

date 

Description 

Silent 

41320161948.00368 

P=Hidden State—"M" and 
friends count<1000 and 
followers count<20000 and 
followers count>500 





Flag the tweet whenever 
there is a malicious 
(HiddenState) tweet with 
less than 1000 friends and 
more than 500 followers 
and less than 20000 

false 


Figure 11. Instance of Rule-1 from R4B. 


♦ Rule 3: Flag whenever no event Q occurs between P and R (1) 

^*CZH=*9 


O 

u 


Instances of this rule: 


Instance ID 

Events 

Count 

limits 

Time 

bounds 

Time 

units 

Expiration 

date 

D( 

23520162258.972187 

Q=HiddenState—' , B",P=HiddenState=="M'',R=HiddenState=-'M" 





1 

V. 

th 


Edit Instance 


Show Example Timeline 


Figure 12. Instance of Rule-3 from R4B. 
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♦ Rule 9 Flag whenever some pair of consecutive E events is less than time T apart (1) 



U 

e 

Instances of this rule: 


Instance ID 

Events 

Count 

limits 

Time 

bounds 

Time 

units 

Expiration 

date 

Description 

Silent 

8352016234.931245 

E=HiddenState==="S" 


T=4 

weeks 


Flag the tweet whenever two users 
have suspicious (HiddenState) tweets 
and are created less than 4 weeks 
apart. 

false 


Edit Instance 


Show Example Timeline 


Figure 13. Instance of Rule-9 from R4B. 


♦ ’Rule 19: Flag whenever more than N events E within time T after Q (1) 



Edit Instance 


Show Example Timeline 


U 

u 


Instances of this rule: 


Instance ID 

Events 

Count 

limits 

Time 

bounds 

Time 

units 

Expiration 

date 

Description 

Silent 

183520162325.89062 

E=HiddenState==="M" and 
friends count<1000 and 
followers count<20000 and 
followers count>500,Q=HiddenState==="S” 

N=3 

T=10 

days 


Flag the tweet 
whenever 
there are more 
than 3 tweets 
which are 

false 


Figure 14. Instance of Rule-19 from R4B. 
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♦ Rule 21 Flag whenever event Q occurs more than N times between some pair of consecutive E events (1) 




i=L 


Instances of this rule: 


Edit Instance 


Show Example Timeline 


& 


Instance ID 

Events 

Count 

limits 

Time 

bounds 

Time 

units 

Expiration 

date 

Description 

Sik 

203520162342.92953 

Q=text.indexOf("http")>=0,E=HiddenState=“"S 

" N=2 




Flag the 
tweet 
whenever 
there are 

fal 


Figure 15. Instance of Rule-21 from R4B. 


D. VALIDATION OF ASSERTIONS IN RULES4BUSINESS 

We validated our assertions by uploading a file called “validation spreadsheet”. 
Before uploading such a csv or xlsx format spreadsheet (Figure 16), we specified column 
indexes as follows: “user_created_at=l, text=2, foliowers_count=3, friends_count=4, 
FIiddenState=5, time=l.” Because in our spreadsheet the date of account creation, text 
part of the tweet, number of followers, number of friends, and state of tweet information 
are given in same order. For example, number of follower information is shown in third 
column. A time column represents the baseline (x-axis) of the flag timeline diagram. 



0 - 0_denemeR4B xisx 

Finished executing assertions using this data You can now visualize rule behavior 



■ 

H • 

" 77 vt;-' H HiddenState=5, (time=l 






Figure 16. Naming the Columns in R4B. 
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In the validation phase, we use two different versions of validation spreadsheets 
as shown in Figures 17 and 18. The validation spreadsheet consists of all columns used in 
the R4B site and the HiddenState column. These spreadsheets have different values to 
induce flags. The highlighted cells in Figure 18 shows the differences between the two 
spreadsheets. Thus we can check the expected results as explained below. Figures 17 and 
18 depict the first 12 rows of the validation spreadsheet. 


user created at 

text 

followers count 

friends count 

HiddenState 

2009-03-13 

Super Lol https:\/Vt.coVlrj3JaAEjn 

2946 

1506 

S 

2009-04-18 

@BluthX @joanwalsh How is stating fa 

219 

740 

S 

2009-06-03 

23.1 use to be the captain of SCTCC LYIV 

204 

149 

S 

2010-01-14 

The seats are being filled ahead of the 

265880 

420 

B 

2010-05-15 

@staceymurdoughl thanks chick! I'll ta 

1226 

828 

S 

2010-05-18 

@The Gatorr get on damn lol 

695 

2 

M 

2010-07-12 

@TT Sisters @LittleMeThatter @GaryB 

204 

149 

S 

2010-07-13 

@Theominiking \nYou was absolutely ; 

580 

1 

M 

2010-07-19 

talayelhttp 

935 

24 

M 

2010-07-19 

Most amazing moment of 2016. Discuss 

935 

736 

M 

2010-07-22 

@TRobinsonNewEra @lynbrownmp @ 

543 

312 

S 

2010-07-22 

You can stop it. Yes . Can stop it. 

326 

2 

S 


Figure 17. R4B Validation Spreadsheet Version 1 


user created at 

text 

followers count 

friends count 

HiddenState 

2009-03-13 

Super Lol https:\A/t.coVlri3JaAEjn 

2946 

1506 

M 

2009-04-18 

(5>BluthX @joanwalsh http How is stating 

219 

740 

S 

2009-06-03 

23.1 use to be the captai n of SCTCC LYM C 

204 

149 

S 

2010-01-14 

The seats are being filled ahead of the st; 

265880 

420 

S 

2010-05-IS 

staceymurdoughl thanks chick! I'll take p 

1226 

828 

s 

2010-05-18 

(®The Gatorrget http on damn lol 

695 

2 

S 

2010-07-12 

@TT Sisters @UttleMeThatter @GaryBarl 

204 

149 

M 

2010-07-13 

@Theominiking\nYou was absolutely am 

580 

1 

M 

2010-07-19 

talayelhttp 

935 

24 

M 

2010-07-19 

Most amazing moment of 2016. Discussing 

935 

736 

M 

2010-07-22 

@TRobinsonNewEra @lynbrownmp @Grc 

543 

312 

S 

2010-07-22 

You can stop it. Yes . Can stop it. 

326 

2 

S 


Figure 18. R4B Validation Spreadsheet Version 2 


According to the first validation spreadsheet, we expect Rule-1 to induce an RM 
flag in rows 6, 8, 9, and 10. For Rule-3, rows 8 and 10 are expected to induce a RM flag. 
While for Rule-9 there is a single RM flag expected in row 11, we do not expect any 
flags for Rule-19 and 21. 
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We perturbed some values in the first validation spreadsheet (Figure 17) to 
create a second validation spreadsheet (Figure 18). For example, in row 8, we changed 
the HiddenState value from malicious (M) to suspicious (S). The user now has less than 
1000 friends and his/her number of followers is between 500 and 20,000. It was induced 
an RM flag for Rule-1, but after changing the HiddenState value to suspicious, it is 
expected to not induce an RM flag; indeed, it did not—as depicted in Figure 19. 
Likewise, Rule 9 is expected to not induce an RM flag in row 6 because it was not a 
suspicious tweet; indeed, it did not—as depicted in Figure 20. However, when we 
changed the HiddenState column to suspicious, Rule 9 is expected to induce an RM flag 
because these two tweets were created three days apart and both of them were classified 
as suspicious; indeed, RM induced such a flag, as depicted in Figure 20. 


This is the essence of the validation phase: to check that all rules induce an RM 
flag precisely when expected to do so. 


Row No 


There is no 
flag for row 
6 since we 
changed the 
HiddenState 
value from 
M to S. 


I IT 


Flag Flag 


Flag 


Flag 


F1»I 


Flag 


frri 



M 

14 


26 

26 


32 

32 


36 


37 

37 


07 19 2010 OS 22 2011 OS 29 2013 

05 IS 2010 07/19 2010 09 22 2012 10 16 2013 

07 13 2010 06 IS 2011 0104 2013 10 22/2 

P"Hi<kfcnSt»te““Nr and fhends_count 1000 and foUowers_count 20000 and follw 
Flag the twee* whenever there it a malicious (HiddenState) tweet with less than 100C 


Flag 


Flag 


P»* 


P 

Flag 


Ft 


P 


Flag 


P 

Flag 


9 

9 

07 

13 2010 


10 

10 

19 2010 
07 


13 22 26 46 4S 

13 22 26 43 45 

02 09 2011 01 04 2013 1106 2014 

19 2010 09 22 2012 09 29 2014 


Event assignments 
Description 


P=HiddenState“"M’ and ftiends_caint 1000 and followers_count 20000 and followers_couct 5 
Flag the meet whenever there is a malicious (HiddenState) meet with levs than 1000 fhends and 


Figure 19. Rule-1 Flag Timeline Diagram 
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Row No. 

Cycle. 

Date 


timecutFire.E E 
timeoutFire.E timeoutFire 







timeoutFire .E 

Flag 


timeoutFire 

E 


«i ■+'■«*- 




w 


11 

11 


13 

13 


14 

14 


15 

15 


12 

04 IS 2009 01 14 2010 07 12 2010 07/222010 06 18 2011 

03 13 2009 06 03 2009 05 15 2010 07 22 2010 02 09 2011 08 062 


Event assignments: E“HiddenState="S" 

Description Flag the tweet whenever two users have suspicious (HiddenState) tweets and are created less tLm^Kveek^iMft 

When we changed 


T liuinz bounds 


T*4 weeks 


I 


timeoutFireE E 

tsmeowFire.E tuneoutFire.E E 


*4wMb>..»LfcfoMbi ». 



the HiddenState 
value to suspicious, 
this row flagged 
since it is just 3 
days apart from 
row 5 which is also 
f la * suspicious. 




Row No.: 

•> 

3 

4 

5 

6 

11 

12 

14 

15 

Cycle: 

0 2 

3 

4 

5 

6 

11 

12 

14 

15 

Date: 

04/18 2009 

06 03 2009 

01 14 2010 

05 15 2010 

05 18 2010 

07 22 2010 

07 222010 

06 IS 2011 

bt 


Event assignments E=*HiddeuState—"S" 

Description Flag the tweet whenever two users have suspicious (HiddenState) tweets and are created less than 4 weeks apart 


Timing bounds T«4 weeks 


Figure 20. Rule-9 Flag Timeline Diagram 


R4B presents visualization for behavior of each rule. Figure 21 presents this 
visualization for Rule 9. In Figure 21, the upper left window shows statechart diagram. 
Lower left window presents uploaded file with flagged tweet in row 11. Flagged tweets 
are displayed in red. The right window is the timeline diagram showing all flag and non- 
flag states in time axis. The tweet in row 11 is flagged since tweets in row 7 and 11 are 
both suspicious and they are less than four weeks apart. 
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C ft 0 rules4business.com:8080/acmeBank/sGl. 0 Q G ^ 


Transaction date: 07,'22:20!0 Data-source row No.: row:It 


D ailes4business.com:8080/acmeBank/liveTimeline.html?sRuleName=Flag%20whenever%20so C 


Data-source event :hat fired: E=HncenState='S' 


Live Timeline Diasram for: Flas whenever so 
Rule imtance ID: 83520162M.931245 


e pair of consecutive E events is lea than tune T apart 


r~ 

Imt : a 


'Properties* 
public mt T: 

Tuner tuner = new Tuner (T): 



| Rules4Business - Loaded Logfile Data - Google Chrome 


- 

1 

□ 


0 rules4business.com:8080/acmeBank/csvTable.html 




§ 



1 ! |3" 19 2010 ||Moit amaiins moment of 2016. Discussins my dissertation with Fatou Bensouda 

11935.0 

11736.0 

\H 


lb' ” ■»01o| § TR obmm\’ewEra ftlsnbroranp SGropeapania SWestKamLabour Thisisanlslanuc 
I” r Handi» out crv &om liberals but do thatwith a cross 

s ®| 1 410 

|in° 



|2 |3: 22 2010||You can stop it. Yes. Can stop it. 

11326.0 

mo 

IIS 


ill: ||D2 09 2011||To Six Flasu we eo'lattp 

11927.0 

1122.0 

ItM 


P |06 18 2011 ISPatsvdod And m me can beleeb 1 is 10. 

11695.0 

112.0 

IIS 


1 1 : |3S 06 2011 |Some people can vote dead Bods’! Aivavedied =AMYCA2016 

11267.0 

11146.0 

1 u 



tiineoutFire 
timeoutFiieE 
tmeouiFirel E E 

•.uceoutfiiiE tmieouiFire 


soutFireE 

Fl». 


1 


5 


11 


12 


14 


IS 


It 


19 


24 


25 


2S 


32 


33 


3t 


39 


49 


3 1 2 3 4 5 ' 11 12 13 14 15 16 17 It 22 23 24 2« 27 2S 29 30 31 33 34 35 31 39 40 41 43 45 45 

01 14 2010 07:22:2010 08 22 2011 1112 2012 0025 2013 10 00 2013 05 142014 11:22 20 

03 13 2009 05 15 2010 02 09 2011 10 19 2011 12 15 2012 08 03 2013 11 10 2013 O' 14 2014 09 1 1 

04 18 2009 O' 12 2010 00 18 2011 03 02 2012 01 10 2013 08 29 2013 12 04 2013 09 03 2014 

00 03 2009 07,22:2010 08 00 2011 09 22 2012 04 04 2013 09 04 2013 04 012014 10 04 2014 

s E=Hidda5tae="S' 

Flat the tweet whenever two users hive suspicious (Hidden State! tweets and are created less Am* weeks ap a.t 


Figure 21. Success Flag for Rule 9 in R4B 


E. STANDARD STATEROVER RULE CREATION AND CODE 

GENERATION 

The StateRover provides detection of behavioral patterns by using deterministic 
UML based statechart patterns. StateRover extends the statechart based notation by 
combining statechart diagrams, Java action language, and the built-in Boolean flag 
bSuccess. 

The StateRover is referred by this thesis because it is used as part of the code 
generation process; the code generator referred to in section H does so using code 
generated by the StateRover; therefore, we converted out R4B diagrams to StateRover 
diagrams. If you do not want to read the details about StateRover, please skip to the next 
section. 

In this phase, R4B diagrams are converted to StateRover diagrams. For each R4B 
rule, we created a corresponding StateRover statechart-assertion (Figure 22). Statechart 
assertions start with an initial state. Events are the transitions between states. The final 
state is the Flag state that shows whether assertion fails or succeeds. If the StateRover 
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reaches Flag state, it assigns a false value to the bSuccess Boolean variable, meaning that 
the assertion discovered a flagged scenario. 



Figure 22. Rule-19 Statechart Diagram in StateRover 


In the StateRover, automatic code generation requires that only two arguments be 
created. The first one is the rules_events.properties files for each rule. These files are 
simple text files; Figure 23 shows an example. They contain text that we already used in 
the R4B phase (see the Events and Limits&Bounds sections in Table 5). 


© Java - Rules/src/Rule19/Rule19_events.properties - Eclipse — 

File Edit Navigate Search Project gun Window Help 

□ X 

r3* . @ 

| Quick Access | ' 

fi | Java 

£ Package Explorer 23 gij JUm't “ □ 

[§) Rule19_events.properties 23 1 1=3 □ 


G % 1 & 

Rule1.statechart_diagram a 

^ Rule1.statechart_properties 
v $ Rule19 

> 5) Rul«19.java 

> (J) SanityTest.java 

1 #Rulel9 Events.properties a 

2 E-followers_count>500 and followers_count<20000 and £riends_count<1000 

3 Q=HiddenState="M" 

4T-10 days 

5N-3 

d HU 

| Find | 


Figure 23. Rules 19_events.properties File 
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The StateRover implements a two-step process to perform validation for checking 
the R4B diagrams are drawn accurately in the StateRover. In the first step it generates 
Java code by simply saving our statechart diagrams. In the second step, we need a JUnit 
test to execute to assure that the StateRover has the same behaviors for each statechart 
assertions as their R4B counterparts. All JUnit sanity test codes are in Appendix A. 
Figure 24 shows that all sanity tests run correctly. 


O Java • Rules/src/Rule19/SanityTest.jdva - Eclipse 

File Edit Source Refactor Navigate Search Project Bun Window Help 

rt’ □> 0 - /*>--*><;,<□] Hi*.J - 

:« Package Explorer j/tf JUnit £2 “ □ @) SanityTestjava £2 

1 P a <*ase Rulel9; 

2 

3 //import static org.junit.Assert.*; 


□ X 


[Quick Access ; (g | 9J Java 
= □ = □ 


Finished after 0.154 seconds 


Runs: 10/10 


■ qiJ Rule21.SanityTest [Runner JUnit 4] (0.000 s) 

tg test (0.000 s) 
g testl (0.000 s) 

■ dQ Rulel.SanityTest [Runner JUnit 4] (0.000 s) 

g test (0.000 s) 
g testl (0.000 s) 

’ nil Rule19.SanityTest [Runner JUnit 4] (0.000 s) 
£] test (0.000 s) 
g testl (0.000 s) 

’ Gh Rule3.SanityTest [Runner JUnit 4] (0.001 s) 
eg test (0.000 s) 
g testl (0.001 s) 

’ §0 Rule9.SanityTest [Runner JUnit 4] (0.000 s) 
g test (0.000 s) 
g testl (0.000 s) 


= Failure Trace 


> h ° 


5^ import junit.framework.TestCase;Q 


9 public class SanityTest { 


11 

12 » 


20^ 

21 


Rulel9 rule; 

@Before 

public void setup (){ 

rule - new Rulel9(); 

rule.N-3; 

rule.T-10; 

rule .execTRreset () ; enforce che setting of T 

> 

@Test 

public void test() { 
rule .incrTime ( 1 ) ; 
rule.QO; 
rule. incrTime (1); 
rule . E () |r 
rule. incrTime (1); 
rule. E () ; 
rule.EO ; 
rule.EO ; 
rule. incrTime (8) ; 
rule.timeoutFire(); 


Javadoc De: Q Console £2 □ Properties = □ 

<terminated> Rules llllnitl CAPrnnram Files\lava\ire1.fl-0 77\hin\iavaw.eife tAnr Ifi ?01fi 5:5ft:1Q PM1 
Writable Smart Insert 25:17 


Figure 24. Sanity Tests Run 


F. LEARNING PHASE FOR HMM 

In the learning phase, we need to add a HiddenState column to our spreadsheet as 
column 7. Each tweet can be classified within three categories: malicious (M), suspicious 
(S), or benign (B). In this part, in order to populate the learning-phase spreadsheet, we act 
as a tweet classification expert. Table 5 defines the rule used for specifying the values of 
HiddenState. In order to determine the values of the HiddenState column, a human 
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operator uses followers_count, friends_count, user_verified, and geo_enabled columns. 
Figure 25 shows a snippet of our learning phase spreadsheet. HMM creation and the 
learning algorithm are explained in Section G. 


Table 5. The Rule to Determine HiddenState in Learning Phase 


Columns 

Observation 

Action 

followers_count: 

If the number of followers is between 500 and 
20000 

add 1 to total 

friends_count: 

If the number of friends is <1000 

add 1 to total 

user_verified 

If the user is not verified 

add 1 to total 

geo enabled 

If the account does not enable geolocation 

add 1 to total 

If the total is 

4 : assign M (malicious) 

2-3 : assign S (suspicious) 

0-1 : assign B (benign) 


followers count 

friends count 

user verified 

geo enabled 

HiddenState 

2946 

1506 

FALSE 

TRUE 

S 

219 

740 

FALSE 

TRUE 

S 

204 

149 

FALSE 

TRUE 

S 

265880 

420 

TRUE 

TRUE 

B 

1226 

828 

FALSE 

TRUE 

S 

695 

2 

FALSE 

FALSE 

M 

204 

149 

FALSE 

FALSE 

S 

580 

1 

FALSE 

FALSE 

M 

935 

24 

FALSE 

FALSE 

M 


Figure 25. The Population of the HiddenState Column 


G. GENERATING THE HIDDEN MARKOV MODEL (HMM) 

An HMM is a statistical model in which the state is not fully visible. However, 
the observable, which depends on state, is visible. The model determines one of the 
possible outputs by looking at the sequence of observables [19]. It provides a way to 
capture patterns that are essential to for making more accurate decisions. 
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In our thesis, the hidden states are the categorization of tweets. We already 
determined our hidden column according to the rules in the learning phase (Table 5). The 
spreadsheet, including populated hidden column and other visible data used in R4B rules 
instances, is our learning-phase spreadsheet (csv file). The learning phase csv also 
contains an indication of the initial state. The learning phase spreadsheet separates data 
into two types: visible data (friends_count, foliowers_count, text, and user_created_at) 
and hidden data (HiddenState). Let v, h, and N represent visible data, hidden data, and 
total number of rows, respectively. Also, let h, and y, be the values of the hidden and 
visible columns in row i. As Drusinsky states in [23], our HMM learns the probability as 
shown below: 

• The probability of transition between states is obtained by dividing the 
number of specific transition to N -1 (that is the total number of transition 
in spreadsheet). For example, we have 40 transitions from suspicious (S) 
to malicious (M) in h and N is 81. So, the probability of HMM transition 
for S->M is #(S -> M)/(N-1) that equals 0.5. 

• Suppose that for a given row i, v; is k and hi is M. The probability of an 
observable k being emitted in a hidden state M is calculated by dividing 
the number of times when a row satisfies k and M with N. 

• Initial state probability distribution is the proportional number of times a 
hidden state is marked as an initial state. For instance, if we have two 
states of three marked as initial states then the initial state probability 
distribution is [0.5, 0.5, 0]. 

While R4B supports an event such as followers_count<500 , being an infinite set 
of possible observables, a typical HMM operates on a relatively small number of 
observables. Therefore, we need to quantize our values with corresponding value range 
such as followers_countLT500 (number of friends is less than 500). Table 6 indicates how 
columns user_created_at, friends_count, and followers_count will be quantized. The 
Python codes for quantization are presented in Appendix B. Quantization enables our 
toolset to map generic names like P, Q, and R. 

The HMM generator needs a column named HiddenState. As we showed in 
section F, we played the role of an expert and filled the cells for HiddenState column. A 
snapshot of the final learning-phase csv file is shown in Figure 26. 
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Table 6. Quantization of Columns. 


Columns 

Values 

Description 

user_created_at 

new 

old 

If the account is created 
more than two years 
ago, this account is 

OLD. Otherwise it is 
NEW. 

friends_count 

friends_countLT500 
friends_count500to 1000 
friends countGT 1000 


followers_count 

followers_countLT500 

followers_count500to20000 

followers_countGT20000 



F) 0 denemeHMM IS HS.csv □ 

1 |lnitialState,user_created_at,text,followers_count,friends_count,HiddenState 

2 Y,2009-03-13,Super Lol https:\/\/t. co\/lr;)3JaAE;jn, 2946,1506, S 

3 Y,2009-04-18,©BluthX ©joanwalsh How is stating fact an opinion,219,740,S 

4 Y,2009-06-03,23. I use to be the captain of SCTCC LYM Crew #reasonsforsteeltoeguest,204,149,S 

5 Y,2010-01-14,The seats are being filled ahead of the start of the 2016 #MOGOAwards at the Alisa h 

6 Y,2010-05-15,@staceymurdoughl thanks chick! I'll take pics X,1226,828,S 

7 Y, 2010-05-18, @The__Gatorr get on damn lol, 695,2,M 

8 Y,2010-07-12,@TT_Sisters ©LittleMeThatter ©GaryBarlow Very and the video is even cuter x,204,149, 

9 Y,2010-07-13,@Theominiking \nYou was absolutely amazing on Theovision \ud83e\uddl7\ud83e\uddl7\uc 

10 Y, 2010-07-19,talaye!http,935,24, M 

11 Y,2010-07-19,Most amazing moment of 2016. Discussing my dissertation with Fatou Bensouda,935,736, 

12 Y,2010-07-22,©TRobinsonNewEra ©lynbrownmp ©Gropeapanda ©WestHamLabour This is an Islamic sign an 

13 Y,2010-07-22,You can stop it. Yes . Can stop it .,326,2,S 

14 Y,2011-02-09,To Six Flags we go!http,927,22,S 

15 Y,2011-06-18,©Patsydogl And no one can beleeb I is 10.,695,2,M 

16 Y,2011-08-06,Some people can vote dead Body! Aiyavedied #AMVCA2016,267,246,S 

17 Y,2011-08-22,2-1 Iggy on shorthanded goal,962,7,M 

18 Y,2011-08-22,©oufcoli ©BurdonGeorge @OUFC_ no jobs !? \ud83d\ude02I'm a copper . Do believe I was 

19 Y,2011-10-19,Need to be swooped to the Coliseum,1029,841,S 

20 Y,2012-03-02,"V"*if you look at this painting and all you can see are naked bodies ... the proble 

21 Y,2012-07-18,all this chisme gots me like https:\/\/t.co\/wrmK3VBdbK,695,1019,S 

22 Y,2012-09-06,\ud83d\udc97\n\n\ud83d\udc97\n\n\ud83d\udc97\n\n\ud83d\udc97\nl'VE NEVER GOTTEN A CA 

23 Y,2012-09-22,©BleuSergy Would be interesting to visualise.,1581,273,M 

24 Y,2012-09-22,Heading to top 3rd,326,246,S 

25 Y,2012-11-12,©carriekovarik happy share some screen grabs of ©MyDermPortal for your next presenta 

26 Y,2012-12-1S,It's just a party in Pittsburgh today isn't it,587,716,S 

27 Y,2013-01-04,©blackzeusx and I pay tribute to the king. fBrooklyn fnocorious #hiphop #music #trit 
_ Y. 2013-01-16. "GHeLilceGoodMusic thanks for the follow Go check out our single \""Keep It On The Lc 


The first column for each row is Y that is just a special column indicating Initial State. 


Figure 26. Learning-Phase CSV File. 


The last work in this step is to run a command for generating an hmm.json file 
that includes the HMM in JSON (JavaScript Object Notation) (Figure 27). 
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SB Komut Istemi 


For state a seeing CoapountOutput <OLD,follOKers_count50eto20®ee,friends_countLT500> 

For state s seeing CoapountOutput <OlD,followers_countLT5®0,friends_countLT50O> 

For state ■ seeing CoapountOutput <OLO,followers_count500to20®®0,friends_countlT500> 

For state s seeing CoapountOutput <0LD,follo><ers_countLT500,friends_countLT500> 

For state s seeing CoapountOutput <OLD,followers_count50eto20000,friends_count50etol@ee> 
For state s seeing CoapountOutput <OLD,follo«ers_count5Oato2O©O0,friends_count5OatolOOO> 
For state s seeing CoapountOutput <OLD,followers_count500to20000 J friends_countGTl000> 

For state s seeing CoapountOutput <OLD,followers_countLT50O,friends_countLT50O> 

For state a seeing CoapountOutput <OLD,followers_count5OOto2©OO0,friends_countLT5Oe> 

For state s seeing CoapountOutput <OLD J follo**ers_countLTSO0,friends_countLT50a> 

For state s seeing CoapountOutput <OLD,followers_countLT500,friends_count5eetolO0O> 

For state s seeing CoapountOutput <OLO,follo*<ers_count5O0to2O000,friends_countSOOtol000> 
For state a seeing CoapountOutput <OLD,followers_count5OOto20OO0,friends_countLT50®> 

For state s seeing CoapountOutput <OLD J followers_count500to20O00,friends_countLT500> 

For state s seeing CoapountOutput <OLD,followers_count50eto2O©O0,friends_countLT5Oa> 

For state s seeing CoapountOutput <OLD,folloxers_countlT500,friends_countLT500> 

For state s seeing CoapountOutput <OLD J followers_countLTS0O,friends_countLT500> 

For state s seeing CoapountOutput <OlD,folloxers_count5©Oto20©©0,friends_countGTl®©0> 

For state a seeing CoapountOutput <OLD,followers_count5O0to2Oe00,friends_countLT500> 

For state s seeing CoapountOutput <OLD,folloxers_count5Oato2OOO0,friends_countLT50O> 

For state s seeing CoapountOutput <OLO,followers_countlT500,friends_countLT500> 

For state s seeing CoapountOutput <OLD,followers_countLT5®0,friends_countLT5eo> 

For state a seeing CoapountOutput <OLD,followers_count500to2OOO0,friends_countLT5O0> 

For state a seeing CoapountOutput <OLD,followers_count50Oto20Oe0,friends_countLT50O> 

For state s seeing CoapountOutput <OLD,followers_countLT5®0,friends_countlT50O> 

For state s seeing CoapountOutput <OLD,followers_countLT50O,friends_countLT5O0> 

For state s seeing CoapountOutput <OLD,followers_countLT5ee,friends_countLT5ee> 

For state b seeing CoapountOutput <OLO,followers_countLTS00,friends_countGTl0O0> 

For state s seeing CoapountOutput <OLD,followers_countLT50O,friends_countLT50e> 

For state b seeing CoapountOutput <NEW,followers_countlT500,friends_countGTie00> 

For state s seeing CoapountOutput <NEW,follOKers_countLT5ee,friends_countLT5@®> 

For state s seeing CoapountOutput <NEW,followers_countlT5©0,friends_count5O0tol0O0> 

For state a seeing CoapountOutput <NEW,followers_count500to20O00,friends_countLTS00> 

For state s seeing CoapountOutput <NEW,followers_count5O0to200O0,friends_countGT10O0> 

For state a seeing CoapountOutput <NEW,followers_count50Oto20O00,friends_countl.T50O> 

For state a seeing CoapountOutput <NEW,followers_count5OOto2O0O0,friends_countLT5OO> 

For state s seeing CoapountOutput <NEW,followers_countLT5O0,friends_countLT50O> 

For state s seeing CoapountOutput <NEW,follOKers_countLT5O0,friends_countLT500> 

For state s seeing CoapountOutput <NEW,followers_countLT500,friends_countlT5O0> 

For state s seeing CoapountOutput <NEW,followers_countLT50O,friends_countLT50O> 

For state s seeing CoapountOutput <NEW,folloxers_countLT50O,friends_count5O0toie©O> 
b<! output file is stored in: C:\users\zeyzeyia\Google Drive\new_begin\dene»eler\haa.json 

^:\Users\zeyzeyia\Google Drive\new_begin> 


Figure 27. Command Prompt Run for Quantization 


H. GENERATING SPECIAL JAVA CODE FOR PROBABILISTIC 
RUNTIME MONITORING 

In this step, we create a new Java project including automated Java codes and 
sanity tests (Figure 28). Sanity tests validate the generated Java codes running correctly. 
Appendix C shows the sanity tests of Rulel_DTRA, Rule3_DTRA, Rule9_DTRA, 
Rulel9_DTRA and Rule21_DTRA java files. 
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0 Java - Rules/src/Rule21/SanityTest.java - Eclipse 

File Edit Source Refactor Navigate Search Project Run 

n - E X * 

(2 Package Explorer £3 J JUnit 
V £2 DTRA.Rules 
v 3 src 

> 3 com.timerover.staterover.ifacesrc 
v .$} Rulel 

> *Z) Rule1_DTRAjava 

> [7] SanityTestjava 
v ^ Rule19 

> 2) Rule19_DTRAjava 

> [7] SanityTestjava 

v Rule21 

> 2) Rule21_DTRAjava 

> [7] SanityTestjava 
v jj} Rule3 

> 2) Rule3_DTRAjava 

> [7] SanityTestjava 

v Rule9 

> 2 Rule9_DTRAjava 

> [7] SanityTestjava 

> fi k JRE System Library [JavaSE-1.8] 

> JUnit 4 

1 > 8^ Referenced Libraries _ 


Figure 28. DTRA_Rules. 


The special Java code for probabilistic RM implements an algorithm indicated 
in [23]. This algorithm uses an input sequence in the form of a two-tuple list, such 
as Input={Ki,Pi},{K 2 ,P 2 }, {K 3 ,P 3 }...{K n ,Pn}- K, is either a visible event 
(i.e., friends_count, followers_count columns) or a hidden one (i.e., HiddenState 
column). Pi is the probability of distribution (POD) of Kj. The POD of a visible event is 
1; the POD of a hidden event is taken from the results of running the alpha method on the 
HMM (the Alpha Method in Section I). 

As pointed out in [23], the run time evaluation of an assertion consists of a 
collection of objects called configurations. We label a collection as Col and a 
configuration as Conf. Each Conf has a present state PS(Conf) and a probability value 
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called P(Conf) being the probability of the assertion being in state configuration Conf. In 
the start-up, there is a single configuration Conf whose with P(Conf)=l. Given and event 
Kj whose Pi is less than 1 (i.e., K, is hidden), the Conf respond with the pairs, {Si, Pi}, 
with two configurations called Confl and Conf2. The probabilities and states of Confl 
and Conf2 are calculated as follows: 

P(Conf 1 )=P(Conf) *Pi and P(Conf2)=l-P(Confl) 

PS (Confl) is the following state decided by transition, if event fired. 

Otherwise, PS(Conf) assigned the PS(Conf2). 

Note that two or more configurations that share the same present states are 
combined into one configuration as Conf com bmed by summing all participating P’(Conf) 
probabilities. 

A statechart assertions declares the probability of a violation of its corresponding 
requirements, also known as probability of failure (POF) [23], being the sum of all 
P(Conf) for all Conf that reach the StateRover error (R4B flag) state. 

I. RUNTIME MONITORING 

1. The Alpha Method 

A critical part of the novel RM process used in this thesis is the execution of the 
HMM alpha-method detailed in the sequel. The outcome of this step is a file call 
alpha.json (Figure 29). 
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Figure 29. Generating alpha.json File. 


According to [23], the alpha method calculates the HMM’s POD over time, given 
the input csv file (Figure 30). More specifically, for every time slot t (i.e., for every row 
of the csv file), each HMM state s is assigned an alpha value a s (t) being the probability of 
the HMM being in state s at time t. 
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Cf C:\Users\zeyzeyim\Google Drive\new_begin\denemeler\0_denemeHMM_NO_IS_HS.esv - Notepad-*-* 

Dosya DGzenle Ara Gorunum Kodlama Oilier Ayarlar Makrolar ^aliftir Eklentiler Pencereler ? 


□ X 


X 


o s © * | g 3 1 s» i > c 3 oa a| a @ ta 0 si a 



1 user_created_at,text,followers_count,friends_count 


2 2009-03-13,Super Lol https:\/\/t.co\/lrj3JaAEjn,2946,1506 

3 2009-04-18,GBluthX @joanwalsh How is stating fact an opinion,219,740 

4 2009-06-03,23. I use to be the captain of SCTCC LYM Crew Ireasonsforsteeltoeguest,204,149 

5 2010-01-14,The seats are being filled ahead of the start of the 2016 tMOGOAwards at the Alisa Hotel in Accra, https:\/\/t.co\/0N641 

6 2010-05-15,@staceymurdoughl thanks chick! I'll take pics X,1226,828 

7 2010-05-18,6The_Gatorr get on damn lol,695,2 

8 2010-07-12,@TT_Sisters ©LittleMeThatter @GaryBarlow Very and the video is even cuter x,204,149 

9 2010-07-13,@Theominiking \nYou was absolutely amazing on Theovision \ud83e\uddl7\ud83e\uddl7\ud83e\uddl7\ud83e\uddl7\ud83e\uddl7 ht 

10 2010-07-19,talaye!http,935,24 

11 2010-07-19,Most amazing moment of 2016. Discussing my dissertation with Fatou Bensouda,935,736 

12 2010-07-22,@TRobinsonNewEra @lynbrownmp @Gropeapanda SWestHamLabour This is an Islamic sign and no out cry from liberals but do tl 

13 2010-07-22,You can stop it. Yes . Can stop it .,326,2 

14 2011-02-09,To Six Flags we go!http,927,22 

15 2011-06-18,@Patsydogl And no one can beleeb I is 10., 695,2 

16 2011-08-06,Some people can vote dead Body! Aiyavedied IAMVCA2016,267,246 

17 2011-08-22,2-1 Iggy on shorthanded goal,962,7 

18 2011-08-22,fioufcoli @BurdonGeorge @OOFC_ no jobs •? \ud83d\ude02I'm a copper . Do believe I was working last night .,326,246 

19 2011-10-19,Need to be swooped to the Coliseum,1029,841 

20 2012-03-02,"\""if you look at this painting and all you can see are naked bodies ... the problem is you",1124,833 

21 2012-07-18,all this chisme gots me like https:\/\/t.co\/wrmK3VBdbK,695,1019 

22 2012-09-06,\ud83d\udc97\n\n\ud83d\udc97\n\n\ud83d\udc97\n\n\ud83d\udc97\nl 1 VE NEVER GOTTEN A CALL FROM YOD SO THIS WOULD MEAN A LOT 

23 2012-09-22,@BleuSergy Would be interesting to visualise.,1581,273 

24 2012-09-22,Heading to top 3rd,326,246 

25 2012-11-12,@carriekovarik happy share some screen grabs of @MyDermPortal for your next presentation;),458,684 

26 2012-12-15,It's just a party in Pittsburgh today isn’t it,587,716 

27 2013-01-04,6blackzeusx and I pay tribute to the king. fBrooklyn Inotorious Ihiphop #music #tribute\u2026 https:\/\/t.co\/mhj7A7iT2r 

28 2013-01-16,"@MeLikeGoodMusic thanks for the follow Go check out our single \""Keep It On The Low\""[Prod by Nard &amp; B] https:\/' 

29 2013-04-04, Lol I love em https:\/\/t. co\/U6VZm;)N8HK, 513,408 

30 2013-06-25,Am j doing something wrong,304,294 

31 2013-08-03,I’m next up\ud83d\ude0a\ud83c\udfc8,492,358 

32 2013-08-18,ICallMeSeb 0SEBTSB Help,695,1116 

33 2013-08-29,Sis pops up to LF for 2nd out #RoadiesSB2016,4051,133 

< > 
Normal text file length: 4879 lines: 56 Ln:1 Col:1 SeJ:0|0 Dos\Wmdows UTF-8 INS 



Runtime 



2. Probability of Flag States 

Each row of each rule has a probability value in the range 0-1. This probability 
represents the likeliness of reaching the flag state. Figure 31 shows a list of probabilities 
for Rule 3. For example, while row 7 has a 47% probability to reach a flag state, this 
probability for row 47 is 100%. The tool presents an effective way to deal with malicious 
users and tweets. Because defining a threshold and analyzing the data up to this threshold 
can save time and effort. 
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Rom 7: 

probability of Flag=0.4729757742430315 

Row 8: 

probability of Flag=0.4729757742430315 

Row 9: 

probability of Flag=0.7603658488632147 

Row 10 

probability of FlagsO.9140943225309589 

Row 11 

probability of Flag=0.9296074288513345 

Row 12 

probability of Flag=0.9742770107832159 

Row 13 

probability of Flag=0.9742770107832159 

Row 14 

probability of Flag=0.9905960332250392 

Row 15 

probability of Flag=0.996912755448593 

ROW 16 

probability of Flag=0.996912755448593 

ROW 17 

probability of Flag=0.998913863436215 

Row 18 

probability of Flag=0.998913863436215 

ROW 19 

probability of Flag=0.9991182773132586 

Row 20 

probability of Flag=0.9992909949642688 

Row 21 

probability of Flag=0.9992909949642688 

Row 22 

probability of Flag=0.9992909949642688 

Row 23 

probability of Flag=0.9997553803421885 

Row 24 

probability Of Flag=0.9997553803421885 

Row 25 

probability of Flag=0.9997553803421885 

Row 26 

probability of Flag=0.9998024710640785 

ROW 27 

probability of Flag=0.9999345420618497 

ROW 28 

probability of Flag=0.9999797268849568 

Row 29 

probability of Flag=0.9999937756946659 

Row 30 

probability of Flag=0.9999937756946659 

Row 31 

probability of Flag=0.9999937756946659 

Row 32 

probability of Flag=0.9999937756946659 

Row 33 

probability of Flag=0.9999979169506338 

Row 34 

probability of Flag=0.9999993648710289 

Row 35 

probability of Flag=0.9999993648710289 

Row 36 

probability of Flag=0.9999993648710289 

Row 37 

probability of Flag=0.9999997892605491 

Row 38 

probability of Flag=0.9999999362467353 

Row 39 

probability Of Flag=0.9999999362467353 

ROW 40 

probability of Flag=0.9999999362467353 

ROW 41 

probability of Flag=0.9999999362467353 

ROW 42 

probability of Flag=0.9999999362467353 

ROW 43 

probability Of Flag=0.9999999362467353 

ROW 44 

probability of Flag=0.9999999362467353 

ROW 45 

probability of Flag=0.9999999362467353 

ROW 46 

probability of Flag=0.9999999362467353 

ROW 47 

probability of Flag=1.0 

ROW 48 

probability of Flag=l.0 

Row 49 

nrohahi1 if u nf Flao—l 

- - -- _ 


A list of probability values; one per cycle (CSV file row) is the probability of the monitor 
reaching the Flag state in that cycle. 


Figure 31. Runtime Monitoring Rule 3 
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IV. CONCLUSIONS 


A. SUMMARY 

In this thesis, we demonstrated a new technique to perform RM with hidden data. 
The purpose of the thesis is to determine whether using such technique for the detection 
of malicious tweets/users has the potential of detecting patterns of interest more 
efficiently than done so far. 

This new technique uses a powerful tool, which is more effective than others are 
since it has the capability of handling datasets including non-observable data. In addition, 
this technique uses English specifications as the starting point, yet caters for unambiguity 
(using underlying formal specifications) and visual debugging; for example, an English 
starting point rule is “Flag whenever event Q occurs fewer than N times between events 
P and R.” 

The technique can be used for pattern detection in many different domains, 
such as detection of fraudulent credit card transactions, traffic light controllers, 
automated border security and warning systems, detection of malicious email, tweet, and 
messages, etc. 

Determining the NL assertions and converting them into corresponding formal 
specification language is a problematic area in software engineering. In our technique, 
UML-based statecharts offer a low learning curve and are very intuitive and simple. 
Automated code generation in the StateRover phase makes the tool a technique 
combining validation and monitoring of data. Simple implementation, domain 
independency, and automated code generation with runtime monitoring are the features 
that differentiate it from other tools. 

In social media, although there is voluminous data flow, it is still possible to 
create an effective system that can detect malicious activities in a brief time and provide 
situational awareness. The important part of the work for a reliable system is to specify 
event sequences that indicate malicious activity. Considering event sequences allow the 
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system to deduce more precise inferences. In this thesis, we specify some event patterns 
indicating malicious activities according to [6]. 

Finally, there is no magic system to detect malicious content in Twitter or other 
social media platforms. However, there are some approaches to create new systems 
providing better situational awareness like this technique. 

B. FUTURE WORK 

Social media analysis is very popular and there are many opportunities for 
extending the scope of this thesis. 

In this thesis, we use the relationship between malicious users and six attributes 
of tweets. These attributes, which are only a small part of all available attributes, are the 
creation date of the account, the number of friends and followers, enabling geolocation, 
verified account, and text part. It is possible to add more attributes for more accurate 
results. What is necessary is to find different indicators of malicious content and user 
behavior, then use them with related attributes in the work flow. 

The malicious content and users can be subclassified, such as “terrorist” and 
“fraudulent behavior.” Because these can have different indicators, future studies could 
focus on either category. 
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APPENDIX. SCREENSHOTS FROM WORK FLOW 


JUNIT SANITY TESTS 
1. Rule-1 



7) SanityTestjava 23 [J) SanityTestjava [J) SanityTestjava 

0 SanityTestjava 

(J) SanityTestjava 

1 

package Rulel; 



2 




3 

//import static org.junit.Assert.*; 



5® import junit.framework.TestCase;Q 



8 




9 

public class SanityTest { 



10 

Rulel rule; 



11 

1 



12- 

©Before 



13 

public void setup(){ 



14 

rule = new Rulel(); 



15 

> 



16 




17- 

©Test 



18 

public void test() { 



19 

rule.P(); 



20 

TestCase.as.serfcFalse(rule.isSuccess ()) ; 


21 

> 



22 - 

©Test 



23 

public void testl() { 



24 

rule.timeoutFire{); 



25 

TestCase.assertTrue(rule.isSuccess()); 


26 

> 



27 




28 

> 



29 




30 





Figure 32. Rule 1 Sanity Test. 









2 . 


Rule-3 


(7) SanityTestjava (7) SanityTestjava £2 [7) SanityTestjava (7) SanityTestjava 0 SanityTestjava 
__ 1 package Rule3; 

2 

3 //import static org.junit.Assert.*; 

4 

5<F import junit.framework.TestCase;^ 

8 

9 public class SanityTest { 

10 Rule3 rule; 

11^ @Before 

12 public void setup!){ 

13 rule * new Rule3(); 

14 ) 

15 

16^ @Test 

17 public void test() { 

18 rule. P () ; 

19 rule .Q_and_notR(); 

20 rule.pT); 

21 rule.RO; 

22 TestCase.assertFalse(rule.isSuccess()); 

23 ) 

24 

25€» @Test 

26 public void testl() ( 

27 rule.Pf); 

2 8 rule .Q_and_notR(); 

29 rule.RO; 

TestCase.assertrrue(rule.isSuccess()); 

31 ) 

32 


Figure 33. Rule 3 Sanity Test. 


3. Rule-9 


[J) SanityTestjava [7] SanityTestjava 2) SanityTestjava £2 0 SanityTestjava 

0 SanityTestjava 


package Rule9; 


3 

//import static org.junit.Assert. - ; 


5# import junit.framework.TescCase;Q 


9 

public class SanityTest 1 


10 

Rule9 rule; 


- 

SBefore 


12 

public void setup (){ 


13 

rule * new Rule9(); 


14 

rule.T-10; 


15 

rule. execTRreset();//enforce the setting of T 


•'416 

> 


It 

STest 


19 

public void test() { 


20 

rule. incrlime(5); 


21 

rule .E(); 


22 

rule .incrlime (15); 


23 

rule.E () ; 


24 

TestCase.assertrrue(rule.isSuccess()); //E's are 

more than 30 units apart 

25 

> 


26 

2“ - 

STest 


2 S 

public void tescl() < 


29 

rule.incrTime(3); 


30 

rule.E(); 


31 

rule .mcrTime (5); 


32 

rule.E(); 


33 

TestCase.assertFalse(rule.isSuccess()); //E's are 

less than 30 units apart 

34 

) 



Figure 34. Rule 9 Sanity Test. 
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4. 


Rule-19 



Figure 35. Rule 19 Sanity Test. 


5. Rule-21 


0 SanityTestjava [J) SanityTestjava [J) SanityTestjava [J) SanityTestjava 

[7) SanityTestjava S3 

1 1 

package Rule21; 


2 

3 

//import static org.junit.Assert.*; 


5* 1 import junit.framework.TestCase;Q 


9 

public class SanityTest { 


10 

Rule21 rule; 


ll*^ 

SBefore 


12 

public void setup(){ 


13 

rule ■ new Rule21(); 


14 

rule.N**l; 


15 

> 


16- 

6Test 


17 

public void test() { 


18 

rule .E(); 


19 

rule.Q and notE(); 


20 

rule .E(); 


21 

TestCase.assertTrue (rule .isSuccess ( )); 


22 

) 


23" 

@Test 


24 

public void testl() { 


25 

rule .E(); 


26 

rule .Q and notE (); 


27 

rule.Q and notE(); 


28 

rule .E(); 


29 

TestCase. assertFalse ( rule .isSuccess()) ; 


30 

> 


31 

32 

} 




Figure 36. Rule 21 Sanity Test. 
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D. PYTHON QUANTIZATION SCRIPTS 


!_& quantizeUserCreated.py- C:\Users\zeyzeyim\GoogleDrive\new_begin\PythonQuantizationSc... — □ X 

File Edit Format Run Options Window Help 

n n n ▲ 

Script for quantizing the user_created_at column. 

The user_created_at column data is provided as an argument: one string separated 

The Java caller prepares the data this way 

The output values are printed to stdout one line at a time 

n n it 

import sys 
import os 
import datetime 
list = sys.argv 

#list = [ 1 /Osers/PythonQuantizationScripts/quantizeOserCreated.py' , '2016-(M-04_ 
#print(len(list)) 
if (len(list) != 2): 

print ("CallError: expecting two arguments (path to this script and a string 

sys.exit(O) 

# make a list by splitting the string with . 
cells = list[l] .split ("_") 

(^quantization 

i = datetime.datetime.now() 
for icell ir. cells: 

# in my case this cell includes date with day-month-year format 
year, month, day = map(str, icell.split ("-") ) 

# icell = year + + month + 1 -' + day 

# print (day, year, month) 
present = datetime.datetime.now() 

(tprint(present) 

created_at = datetime.datetime.strptime(icell, '%Y-%m-%d') 

#created_at = datetime . strptime(i . year, i.month, i.day) 
icell2 = str(int (year)+2) + + month + '-’ + day 

#after2year represents the date "exactly one year after the creation of the 
after2year = datetime.datetime.strptime(icell2, '%Y-%m-%d') 

# If an account created more than 2 year before. It is OLD. 

# If an account created les3 than 2 year. It is NEW. 
if (created_at > present): 

print ("CallError: Account should not be created in future.") 
continue 

#if i.year<=0: print("CallError : Account should not be created in future.") 
elif (present > after2year): user_created_at="OLD" 

else: user_created_at="NEW" _ 

print (user created at) 

i 

[Ltk 41 Col: 0 


Figure 37. Quantize user_created_at Column. 
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Ijjji quantizeFollowers.py - C:\Users\zeyzeyim\Google Drive\new_begin\PythonQuantizationScrip... — □ X 

File Edit Format Run Options Window Help 


Script for quantizing the followers_count column. 

The followers_count column data is provided as an argument: one string separated 
Don’t worry about this part, the Java caller prepares the data this way 
The output values are printed to stdout one line at a time 

nnn 

import sys 
import os 

list = sys.argv 

♦ list = ['/Users/PythonQuantizationScripts/quantizeUserCreated.py', '20_505_2005 

♦print(len(list)) 

if (len(list) != 2): 

print ("CallError: expecting two arguments (path to this script and a string 

sys.exit(0) 

♦ make a list by splitting the string with . 
cells = list(l] .split("_") 

♦print (cells) 

♦print (len(cells)) 

♦quantization 
♦outStr = "" 

followers_count = 0 

for cell in cells: 

icell = int(cell) 

♦ in my case, this column displays the number of follower. So there is no ne 
if icell<0: 

print ("CallError: number of friends cannot be less than zero.") 
continue 

elif icell<500: followers_count="followers_countLT500" 

elif icell<=20000: followers_count“"f ollowers_count500to20000" 

else: followers_count="followers_countGT20000" 

print (followers count) 


Ln: 40 Col: 0 


Figure 38. Quantize followers_count Column. 
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(jjji quantizeFriends.py - C:\Users\zeyzeyim\Google Drive\new_begin\PythonQuantizationScripts... — □ X 

File Edit Format Run Options Window Help 


Script for quantizing the friends_count column. 

The friends_count column data is provided as an argument: one string separated b 

The Java caller prepares the data this way 

The output values are printed to stdout one line at a time 

nun 

import sys 
import os 

list = sys.argv 

#list = ['/Users/PythonQuantizationScripts/quantizeUserCreated.py' , '20_505_2005 

#print(len(list)) 
if (len(list) !* 2): 

print ("CallError: expecting two arguments (path to this script and a string 

sys.exit(0) 

# make a list by splitting the string with . 
cells = list(l) .split( n _") 

^quantization 
♦outStr = " n 

friends_count = 0 

for cell in cells: 

#*•*•** THIS IS WHERE YOU MAKE CHANGES TO THE CODE TO REFLECT YOUR QUANTIZAT 
icell = int(cell) 

♦ in my case, this column displays the number of friends. So there is no neg 
if icell<0: 

print ("CallError : number of friends cannot be less than zero.") 
continue 

elif icell<500: friends_count="friends_countLT500" 
elif icell<=1000: friends_count="friends_count500tol000" 
else : friends_count="friends_countGT1000" 

print (friends count) 


Ln: 31 Col: 0 


Figure 39. Quantize friends_count Column. 
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yf ’C:\Users\zeyzeyim\Google Drive\new_begin\PythonQuantizationScripts\quantization.pfoperties - Notepad** 
Qosya Diijenle Ara fioruntim Kodlama Diller Ayarlar Makrolar £alijtir fcklentiler Pencereler 2 


□ X 


X 


otsia© o .&i * itci a c m <hi a * j sa ui is a e a si « 



1 * Use any name for the script name, but l.h.s must match column name in csv file 

2 

3 ♦ The "followers_count" column quantization (the followers_count column in my csv) script is quantizeFollowers.py 

4 followers_count-quantizeFollowers.py 

5 

€ ♦ The "friends_count" column quantization (the fnends_count column in my csv) script is quantizeFriends.py 

friends_count«quantizeFriends.py 


8 


9 # The "user_created_at" column quantization (the user_created_at column in my csv) script is quantizeUserCreated.py 

10 user_created_at*=quantizeUserCreated.py 

11 

12 # Note that quantization.properties has Python modules only for three columns (followers_count, friend_count, and user_created_at); 

13 # Therefore, other columns will not be used as HMM outputs. 

14 # Stated differently, this is as if the user is saying that the classification of hidden states depends on 

15 # those three columns and not the others. 


Properties file length: 876 lines: IS Ln:1 Col:1 Sel: 01 0 UNIX UTF-8 INS 


Figure 40. Quantization Properties File. 
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E. SANITY TESTS FOR PROBABILISTIC RUNTIME VERIFICATION 


1. Rule-1 Sanity Test for DTRA_Rules 

[J] SanityTestjava S3 

1 

package Rulel; 


2 

//import static org.junit.Assert. 


3- 

import junit.framework.TestCase; 


<3 

import org.junit.Before; 


i 5 

import org.junit.Test;| 


6 

public class SanityXest { 


7 

Rulel DTRA rule; 


8c 

SBefore 


9 

public void setup(){ 


10 

rule = new Rulel DTRA () ; 


11 

) 


12€ 

@Test 


13 

public void test() { 


14 

rule .P(1.0); 


15 

double d = rule. getProbabilityOfSuccess(); 


16 

System. out. println ("d=”+d) ; 


17 

TestCase.assertfalse(rule.isSuccess()); 


18 

// probability 0 of success means probability 1 

of flagging — FLAG 

19 

TestCase.assert£guals(l-d, 1.0); 


2° 

} 


2l€ 

©Test 


22 

public void testl() { 


23 

rule.timeoutFire(1.0); 


24 

double d = rule. getProbabilityOfSuccess(); 


25 

System. out. println ("in test: d="+d) ; 


26 

TestCase.assertTrue (rule. isSuccess()); 


27 

// probability 1 of success means probability 0 

of flagging — NO FLAG 

2 e 

TestCase.assertEguals(l-d, 0.0); 


29 

} 


30 




Figure 41. Sanity Test for Rulel_DTRA. 
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2. 


Rule-3 Sanity Test for DTRA_Rules 


0 SanityTestjava £2 

1 package Rule3; 

2 //import static org.junit.Assert.*; 

3" import junit.framework.TestCase; 

4 import org.junit.Before; 

5 import org.junit.Test; 

6 public class SanityTest { 

Rule3_DTRA rule; 

8© @Before 

9 public void setup(){ 

10 rule = new Rule3_DTRA(); 

11 > 

12© gTest 

13 public void test() { 

14 rule. P(1.0); 

15 rule.R(l.O); 

16 double d = rule. getProbabilityOfSuccess(); 

17 System. out. println( "in test; d="+d) ; 

18 TestCase. assertTalse ( rule .isSuccess()); 

19 // probability 0 of success means probability 1 of flagging 

20 TestCase.assertEguals(1-d, 1.0); 

21 > 

22© gTest 

23 public void testl() { 

24 rule .P(1.0); 

25 rule .Q_and_notR(1.0); 

26 rule. P(1.0); 

27 rule. Q_and_notR(1.0); 

28 rule. R(1.0); 

29 double d = rule .getProbabilityOfSuccess(); 

Svstem. out. println ("in test: d="+d) ; 


Figure 42. Sanity Test for Rule3_DTRA. 
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3. 


Rule-9 Sanity Test for DTRA_Rules 



[7] ’SanityTestjava S3 


8 

public void setup(){ 




9 

rule = new Rule9 DTRA(); 




10 

rule ,T=10; 




11 

rule . execTRreset ();//enforce Che setting of T 

0 

@Test 




13* 



14 

public void test() { 



IS 

rule .incrTime ( 5) ; 



16 

rule. E (1.0); //add probability 1 to events 



17 

rule . incrTime(15); 



18 

rule. E(1.0); //add probability 1 to events 



19 

double d = rule. getProbabilityOfSuccess(); 



20 

System. out. println ("in test: d="+d) ; 



21 

//E's are more than 30 units apart 



22 

TestCase.assertTrue (rule .isSuccess ()); 



23 

// probability 1 of success means probability 0 

of 

flagging 

24 

TestCase . assertEquals (1-d, 0.0); 




} 



26* 

@Test 



27 

public void testl() { 



28 

rule .incrTime (3); 



29 

rule.E(l.O); //add probability 1 to events 



30 

rule .incrTime(5) ; 



31 

rule. E(1.0); //add probability 1 to events 



32 

double d = rule. getProbabilityOfSuccess (); 



33 

System. out. println ("in test: d="+d) ; 



34 

//E's are less than 30 units apart 



35 

TestCase.assertFalse (rule. isSuccess ()); 



36 

// probability 0 of success means probability 1 

of 

flagging 

^7 

_Te.sr.Ca.se . a 7 c M —rl._ 1.01:_ 




Figure 43. Sanity Test for Rule9_DTRA. 
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4. 


Rule-19 Sanity Test for DTRA_Rules 


[7) *SanityTest.java £3 



Rulel9 DTRA rule; 


8© 

SBefore 


9 

public void setup (){ 


10 

rule = new Rulel9 DTRA(); 


11 

rule .N=2; 


12 

rule.T=10; 



rule.execTRreset();//enforce the setting of T 



} 


15© 

@Test 


16 

public void test() { 


17 

rule .incrTime(1); 


18 

rule.Q(l.O); //add probability 1 to events 


19 

rule .incrTime(1); 


20 

rule.E(l.O) ; 


21 

rule. incrTime(1); 


22 

rule. E(1.0); 


23 

rule .E(1.0); 


24 

rule.E(l.O); 


25 

rule .incrTime(8); 


26 

rule.timeoutFire(1.0); 


27 

double d = rule .getProbabilityOfSuccess(); 


28 

System.out.println( n d="+d) ; 


29 

TestCase.assertFalse(rule.isSuccess()); 


30 

// probability 0 of success means probability 1 of flagging — FLAG 


31 

TestCase.assertEguals(l-d, 1.0); 



} 


33© 

@Test 


34 

public void testl() { 


35 

rule.Q(l.O) ; 


36 

_ 

rule .incrTime(3); 


Figure 44. Sanity Test for Rulel9_DTRA. 
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5, 


Rule-21 Sanity Test for DTRA_Rules 


0 SanityTestjava S3 




6 

public class SanityTest { 




7 

Rule21 DTRA rule; 




8^ 

©Before 




9 

public void setup(){ 




10 

rule = new Rule21 DTRA(); 




11 

rule.N-1; 




12 

} 




13w 

©Test 




14 

public void test() { 




15 

rule. E(1.0); 




16 

rule.Q and notE(l.O); 




17 

rule. E(1.0); 




18 

double d = rule .getProbabilityOfSuccess(); 




19 

System. out. println ("in test: d="+d) ; 




20 

TestCase.assertTrue (rule .isSuccess()); 




21 

// probability 1 of success means probability 0 

of 

flagging 


22 

TestCase . assertEquals (1-d, 0.0); 




23 

} 




24"? 

©Test 




25 

public void testl() { 




26 

rule. E(1.0); 




27 

rule.Q and notE(l.O); 




28 

rule.Q and notE(l.O); 




29 

rule. E(1.0); 




30 

double d = rule .getProbabilityOfSuccess (); 




31 

System. out. println ("in test: d="+d) ; 




32 

TestCase.assertFalse (rule .isSuccess ()); 




33 

// probability 0 of success means probability 1 

of 

flagging 


34 

TestCase . assertEquals (1-d, 1.0); 




35 

\ 

-L _ 





Figure 45. Sanity Test for Rule21_DTRA. 
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F. COMMANDS IN CMD 



Table 7. Commands. 

Action 

Command 

Generating hmm.json file 

java -jar dtrahmm.jar 
denemeler\0_denemeHMM_IS_HS. c s v 

Argument-1: learning phase csv file. 

Argument-2: folder of quantization.properties files 

Run the alpha method 

java -jar dtraalpha.jar 

denemeler\0_denemeHMM_NO_IS_HS.csv 

n PythonQuantizationScripts 
Argument-1: runtime csv file. 

Argument-2: hmm.json file 

Argument-3: folder of quantization.properties files 

Runtime monitoring 

java -jar dtrarm.jar 

denemeler\0_denemeHMM_NO_IS_HS.csv 
denemeler\alpha.json DTRA_Rules.jar Rule3_DTRA 
Rules\bin\Rule3\Rule3_events.properties 

Argument-1: runtime csv file 

Argument-2: path to alpha.json file 

Argument-3: path to DTRA_Rules.jar file 

Argument-4: the rule we want to monitor 

Argument-5: path to events.properties files 

Color code for 
arguments 

Argument-1 Argument-2 Argument-3^| j men: -4 
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